Workshop, University of Venice
2024-11-01
Statistician by training, got to where I am through medical, environmental and industrial applications.
Teaching Fellow at Imperial College London
Privacy, fairness and explainability in ML.
Really it all comes down to doing good statistics well.
Protecting Sensitive Data: ML models often train on personal or confidential data (health records, financial info), and safeguarding this data is essential to prevent misuse or unauthorized access.
Compliance with Regulations: Laws like GDPR require organisations to protect user privacy, impacting how data is collected, stored, and used in ML.
Preventing Data Leakage Models can unintentionally expose sensitive information from their training data, risking user privacy if someone exploits the model’s outputs.
Building Trust: Privacy-conscious ML practices foster trust among users, making them more willing to share data and participate in systems that use ML.
Avoiding Discrimination: Privacy techniques can reduce bias and discrimination risks, ensuring the ML model treats users fairly without targeting sensitive attributes.
Reality is much messier. See Gelman and Loken (2013) for a discussion of the implications.
Standard ML assumes that data are cheap and easy to collect.
Assumption that data are cheap and easy to collect. Out of the box model fitting assumes we are working with big, representative and independent samples.
Applications of ML to social science to study hard-to-reach populations: persecuted groups, stigmatised behaviours.
Standard study designs and analysis techniques will fail.
By using subject-driven sampling design, we can better explore the hard to reach target population while preserving the privacy of data subjects who do not want to be included in the study.
Even if data subjects are easy to access and sample from, they may not wish to answer honestly.
Can you give me some examples?
Dishonest answers will bias any subsequent analysis, leading us to underestimate the prevalence of an “undesirable outcome”.
(Interesting intersection with survey design and psychology. The order and way that you ask questions can influence responses but we will focus on a single question here.)
\[\Pr(Y_i = 1) = \theta \quad \text{and} \quad \Pr(Y_i = 0) = 1 - \theta.\] Method of Moments Estimator: (General dataset)
\[ \hat \Theta = \hat\Pr(Y_i = 1)\]
\[ \hat \Theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1) = \bar Y = \frac{\#\{yes\}}{\#\{subjects\}}.\]
Method of Moments Estimate: (Specific dataset) \[ \hat \theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(y_i = 1) = \bar y.\]
Suppose I ask 100 people whether they have ever been unfaithful in a romantic relationship and 24 people respond “Yes”.
What is your best guess of the proportion of all people who have been unfaithful?
\(\hat\theta = \bar y = \frac{24}{100}\)
How confident are you about that guess?
Would that change if I had 1 person responding “Yes”?
Would that change if I had 99 people responding “Yes”?
Over lots of samples we get it right on average:
\(\mathbb{E}_Y[\hat\Theta] = \mathbb{E}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}_Y\left[ \mathbb{I}(Y_i = 1)\right] = \frac{n \theta}{n} = \theta\)
As the number of samples gets large, we get more confident and therefore recover the truth
\[\begin{align*} \text{Var}_Y[\hat\Theta] &= \text{Var}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n\text{Var}_Y\left[\mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n p(1-p) \\ &= \frac{n p (1-p)}{n^2} \\ &= \frac{p (1-p)}{n} \rightarrow 0 \end{align*}\]
Mathematically nice but in reality people lie.
Our estimator worked “best” for central values of \(\theta\), unlikely for stigmatised events.
Add random element to survey to provide plausible deniability.
MoM estimation: Equate probabilities and proportions.
Consider using a weighted coin, probability \(p\) of telling truth. Derive an expression for the probability of answering “Yes”.
\[\begin{align*} \Pr(\text{Yes}) &= \theta p + (1 - \theta)(1 - p) \\ & \approx \frac{\#\{yes\}}{\#\{subjects\}} \\ &= \bar y \end{align*}\]
Rearrange this expression to get a formula for \(\hat \theta\).
\[\hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]
\[ \hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]
Approach to privacy for single, binary response.
Issues with applying to multiple questions, e.g. surveys with follow on questions.
Extensions to categorical and continuous responses and predictors
General principle of adding noise
Collect what you need and use that information only for its intended purpose.
Targeting hard-to-reach populations can be challenging but possible by combining survey design and specific learning approaches. Keeps statisticians in a job!
Asking difficult questions can lead to biased responses. Plausible deniability through randomised response designs can help.
Once we have gone to the effort of collecting data we don’t want to just leave it lying around for anyone to access.
\[ \text{Plain text} \overset{f}{\rightarrow} \text{Cipher Text}\]
\[ \text{Cipher text} \overset{f^{-1}}{\rightarrow} \text{Plain Text}\]
\[ f(\text{data}, \text{key})\] Many encryption schemes depending on the data to be encrypted and how the key is to be distributed.
Write your own short message and pass it to a friend to decode.
What are some benefits and drawbacks of this encryption scheme?
What happens if someone gets access to the data?
\(k\)-anonymity is a measure of privacy within a dataset.
Given a set of predictor-outcome responses, each unique combination forms an equivalence class.
The smallest equivalence class of a \(k\)-anonymous dataset is of size \(k\).
Equivalently, each individual is indistinguishable from at least \(k-1\) others.
I asked ChatGPT to generate 4-anonymous datasets but it hasn’t done a good job.
Establish the true value of k for your dataset.
Use pseudonymisation, aggregation, redaction and partial-redaction to make your dataset 4-anonymous.
What do you think some of the limitations of \(k\)-anonymity might be?
Don’t leave important data lying around unprotected.
Choose a level of security appropriate to the sensitivity of the data.
Consider the consequences of someone gaining access to the data.
Remember that your data does not live in isolation.
K-anonymity not a good measure of privacy but an accessible starting point.
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
loaded via a namespace (and not attached): compiler(v.4.3.3), fastmap(v.1.1.1), cli(v.3.6.3), tools(v.4.3.3), htmltools(v.0.5.8.1), rstudioapi(v.0.16.0), yaml(v.2.3.8), Rcpp(v.1.0.12), pander(v.0.6.5), rmarkdown(v.2.26), knitr(v.1.45), jsonlite(v.1.8.8), xfun(v.0.43), digest(v.0.6.35), rlang(v.1.1.4), png(v.0.1-8) and evaluate(v.0.23)
Privacy by Design - November 2024 - Zak Varty